1 Time Series Regression for Predicting Macro Economic Indıcators

In this homework, I aimed to extend the analyses I have performed with the data from Data Delivery System:https://evds2.tcmb.gov.tr/. More specifically, I want to forecast the given indicator: Consumer Price Index- Clothing and Footwear at a monthly level. Before diving into Time series Regression analysis, first, I need to analyze the characteristics of my target variable(CPI of Clothing and Footwear) ,and select my independent variables which I will put into the model.

1.1 Introduction

First, related data is obtained from the abovementioned source. You can see the reading and pre-processing steps(and all the coding sections in this work) by clicking the Code box on the top right corner at each section.Let’s try to understand our target variable: Consumer Price Index of Clothing and Footwear. For that, I will use ggplot2 and some other time-series/visualization packages.

##        Date Clothing Footwear US_DOLLAR Personal_INTEREST_RATE CCI
##  1: 2008-06   128.23   138.33      1.23                  21.25  NA
##  2: 2008-07   117.98   127.25      1.21                  21.71  NA
##  3: 2008-08   110.59   120.43      1.17                  22.16  NA
##  4: 2008-09   109.40   127.12      1.23                  21.23  NA
##  5: 2008-10   118.68   137.80      1.47                  22.34  NA
##  6: 2008-11   121.98   140.81      1.59                  25.09  NA
##  7: 2008-12   116.98   136.09      1.54                  24.47  NA
##  8: 2009-01   107.10   125.79      1.59                  22.86  NA
##  9: 2009-02   101.11   118.84      1.65                  21.47  NA
## 10: 2009-03   100.87   120.03      1.70                  21.80  NA
##     Household_Fin_CCI CCI-General CCI_Semi_durable
##  1:                NA          NA               NA
##  2:                NA          NA               NA
##  3:                NA          NA               NA
##  4:                NA          NA               NA
##  5:                NA          NA               NA
##  6:                NA          NA               NA
##  7:                NA          NA               NA
##  8:                NA          NA               NA
##  9:                NA          NA               NA
## 10:                NA          NA               NA
colnames(hw3_data)
## [1] "Date"                   "Clothing"               "Footwear"              
## [4] "US_DOLLAR"              "Personal_INTEREST_RATE" "CCI"                   
## [7] "Household_Fin_CCI"      "CCI-General"            "CCI_Semi_durable"
summary(hw3_data)
##      Date              Clothing        Footwear       US_DOLLAR    
##  Length:151         Min.   :100.9   Min.   :118.8   Min.   :1.170  
##  Class :character   1st Qu.:130.1   1st Qu.:142.8   1st Qu.:1.720  
##  Mode  :character   Median :159.5   Median :181.0   Median :2.220  
##                     Mean   :168.2   Mean   :191.9   Mean   :3.041  
##                     3rd Qu.:199.8   3rd Qu.:225.6   3rd Qu.:3.750  
##                     Max.   :262.8   Max.   :308.1   Max.   :8.000  
##                     NA's   :1       NA's   :1                      
##  Personal_INTEREST_RATE      CCI        Household_Fin_CCI  CCI-General    
##  Min.   :10.61          Min.   :77.05   Min.   :64.90     Min.   : 59.80  
##  1st Qu.:14.71          1st Qu.:82.95   1st Qu.:77.17     1st Qu.: 78.35  
##  Median :17.03          Median :89.89   Median :82.45     Median : 88.66  
##  Mean   :17.87          Mean   :88.46   Mean   :80.30     Mean   : 85.11  
##  3rd Qu.:19.60          3rd Qu.:92.54   3rd Qu.:84.50     3rd Qu.: 94.48  
##  Max.   :38.72          Max.   :97.37   Max.   :88.07     Max.   :104.75  
##                         NA's   :43      NA's   :43        NA's   :43      
##  CCI_Semi_durable
##  Min.   : 95.49  
##  1st Qu.:105.83  
##  Median :109.50  
##  Mean   :108.63  
##  3rd Qu.:111.64  
##  Max.   :117.35  
##  NA's   :43

From the initial inspection, we observe that some of the CCI(Consumer Confidence Index) data is missing. We will handle this later if we use them as independent variables. Excluding those, date column starts from 2008-06 for all variables and ends at 2020-12 except Clothing and Footwear CPI variables.

  • The data that I have chosen for my initial analysis are namely:
  • “Clothing”(Clothing CPI),
  • “US_DOLLAR”(US Dollar Exchange Rate),
  • “Personal_INTEREST_RATE”(Personal Weighted Average Interest Rates For Banks Loans),
  • “Household_Fin_CCI”(Financial situation of household),
  • “CCI-General”(Seasonally unadjusted Consumer Confidence Index on General economic situation),
  • “CCI_Semi_durable”(Seasonally unadjusted Consumer Confidence Index on spending money on semi-durable goods).
str(hw3_data)
## Classes 'data.table' and 'data.frame':   151 obs. of  9 variables:
##  $ Date                  : chr  "2008-06" "2008-07" "2008-08" "2008-09" ...
##  $ Clothing              : num  128 118 111 109 119 ...
##  $ Footwear              : num  138 127 120 127 138 ...
##  $ US_DOLLAR             : num  1.23 1.21 1.17 1.23 1.47 1.59 1.54 1.59 1.65 1.7 ...
##  $ Personal_INTEREST_RATE: num  21.2 21.7 22.2 21.2 22.3 ...
##  $ CCI                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Household_Fin_CCI     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CCI-General           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ CCI_Semi_durable      : num  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>

Our Date column is of “Char” format. First, let’s convert it to date format:

hw3_data$Date<-parse_date_time(hw3_data[,Date], "Ym")
hw3_data[,Date:=as.Date(Date,format='%Y-%m-%d')]

2 Understanding Clothing and Footwear CPI

To understand our two candidates as dependent variable better, they are plotted.

ts_data<-ts(hw3_data,start = c(2008, 6),frequency=12)
ggplot<-ggplot2::autoplot(ts_data[,c("Clothing","Footwear")]) +
  theme_classic()+
  labs( x="Date",y="% CPI of Clothing and Footwear",title=("CPI of Footwear and Clothing 2008-2020"))
ggplotly(ggplot)

From the graph, we observe that both indicators are, not surprisingly, significantly correlated. And CPI of footwear has been higher than of clothing for all years. The cyclical pattern(seasonality effect) and positive trend can also be observed in both of the indicators.

Below, you can see the positive trend better:

ggplot<-ggplot2::autoplot(ts_data[,c("Clothing","Footwear")]) +
  theme_classic()+
  labs( x="Date",y="% CPI of Clothing and Footwear",title=("CPI of Footwear and Clothing 2008-2020"))+
  geom_smooth(method = "lm")
ggplotly(ggplot)

Since, we have various candidates for independent variables, let’s analyze their timely behaviours with the help of zoo package:

ts_data1 <- ts_data[,-1]
colnames(ts_data1)<-c("Clothing","Footwear","US_Dollar","Per_int_rate","CCI","Household_CCI","CCI-General","CCI_semi_durable")

plot(zoo(ts_data1))

From the graphs above, we can see the general trends and patterns throughout the years. Unsurprisingly, US_dollar and and CPI of footwear and clothes have similar positive trends. US_dollar could be a good predictor for our target variable, excluding the seasonality. On the other hand, CCI charts shows different behaviors. A sharp decrease can be observed in these values during the last months of 2018. Contrary, personal interest rate shows a sharp increase during those times. However, it is hard to reach a conclusion on whether they have significant effect on our target variable.

3 Correlation Analysis

To conduct a successful regression analysis, one must pay attention the correlation between the variables. Multicollinearity could be a problem when we fit the model and interpret the results. First, let’s see the correlation between the variables:

hw3_data_complete<-hw3_data[complete.cases(hw3_data), ]

M<-cor(hw3_data_complete[,-1])
corrplot(M, method="number")

Here, we can see that Clothing, Footwear and Us dollar exchange rate are significantly correlated. Similarly, CCI-General and Household Financial Status CCI are significantly correlated. Thus, we need to omit using both of them in our regression analysis(See the Multicollinearity problem above). CCI values and CPI of Footwear and Clothes are “negatively correlated”.

Below, Scatterplots of each pair of numeric variable are drawn on the left part of the figure. Pearson correlation is displayed on the right which are the same as what we’ve found. And in the diagonal you see the variable distribution.

4 Regression Analysis

Our target is CPI of Clothing and our independent variables are Us_dollar and CCI of Semi-durable goods. Below you can see the regression summary:

## 
## Call:
## lm(formula = Clothing ~ US_DOLLAR + CCI_Semi_durable, data = hw3_data_complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.384  -9.595   0.643   9.711  34.981 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -90.198     42.713  -2.112   0.0371 *  
## US_DOLLAR          22.431      0.869  25.812  < 2e-16 ***
## CCI_Semi_durable    1.808      0.377   4.797 5.39e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.23 on 104 degrees of freedom
## Multiple R-squared:  0.8809, Adjusted R-squared:  0.8786 
## F-statistic: 384.6 on 2 and 104 DF,  p-value: < 2.2e-16

From the summary, we can see that both of the variables are statistically significant. Also, %88 of the variance can be explained by the predictors. From the summary table the regression equation is found to be :

  • Clothing CPI= -90.198 + 22.431 * US_Dollar + 1.808 * CCI_Semi_durable

Now, let’s see the fitted values vs. predicted values graph:

From the Fitted vs Actual plot, we can observe that errors are distributed equally around the mean. However, the variance seems to be increased when the data is greater than 200. Also, due to seasonality effect on the target variable, the errors do not seem to be normally distributed which means, our model does not explain the CPI perfectly, yet. Keeping those into the mind, let’s continue with the more detailed analysis of the residuals:

5 Residual Analysis

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals
## LM test = 67.809, df = 10, p-value = 1.172e-10

When residuals are correlated, there is some information left over which should be accounted for in the model in order to obtain better forecasts. checkresiduals helps us to see the pattern and autocorrelation of the residuals. From the first residual graph, the seasonality effect is observed better. 2nd chart shows us that there exist a significant autocorrelation between some of the months.(positive autocorrelation in lag 6-12 and negative autocorrelation in lag 3). Histogram chart shows how the residuals are distributed with respect to normal distribution.

5.1 Residuals vs Predictor Plot

Analyzing residuals against the fitted value and residual against the predictors would also give us more insight about how correct our model is. Residual vs Predictor plot is a scatter plot of the residuals on the y axis and the predictor values on the x axis. Optimally, we expect to see the mean of the residuals to be 0 and the distribution of the residuals around the mean is random, that is, no significant pattern exists in the residual values.

From the residual plots above, we can conclude that the residuals are not completely random and show a pattern although the mean seems to be around 0. That means there are some variations in the CPI of Clothing that our model could not explain very well. Although the residuals vs fitted plot will not give us a new information, still we can perform the analysis:

Again, we observe that variance is greater as the fitted value exceed 200 and mean is around 0.

6 Outlier Analysis

One of the weak points of the regression model is that it is quite sensitive to outliers. Although, we have a broader picture about the outliers, due to the previously observed time series plots, still, it is safer to conduct this analysis seperately:

From the graph above, we can see that there is no outlier that can distort the regression model significantly.

7 Adding Seasonality Effect to the Model

As shown in the very first time-series plot of the work, our target value shows a cyclical pattern. The effect of this seasonality has manifested itself in our residual analyses that we have conducted before. Now, let’s take into account the seasonality effect in the model.

ggplot<-ggplot(hw3_data_complete[,c("Date","Clothing")]) +
  geom_line(aes(Date,Clothing))+
  labs( x="Date",y="CPI of Clothing ",title=("CPI of Clothing 2012-2020"))
ggplotly(ggplot)

Above, you can see the peak months and the corresponding CPI values.

##            Date Clothing Footwear US_DOLLAR Personal_INTEREST_RATE   CCI
##   1: 2012-01-01   127.07   142.61      1.84                  19.97 92.50
##   2: 2012-02-01   121.74   137.35      1.75                  19.42 93.80
##   3: 2012-03-01   122.40   141.06      1.78                  18.85 93.19
##   4: 2012-04-01   139.15   157.62      1.78                  18.54 88.71
##   5: 2012-05-01   153.43   163.14      1.80                  18.35 91.93
##  ---                                                                    
## 103: 2020-07-01   245.69   295.13      6.85                  12.37 82.82
## 104: 2020-08-01   239.94   290.93      7.25                  16.54 79.77
## 105: 2020-09-01   240.70   287.89      7.51                  18.48 81.91
## 106: 2020-10-01   259.44   298.99      7.87                  19.50 81.54
## 107: 2020-11-01   262.83   304.78      8.00                  21.08 79.98
##      Household_Fin_CCI CCI-General CCI_Semi_durable month
##   1:             82.32       98.09           102.57     1
##   2:             83.90       99.96           105.52     2
##   3:             82.59      100.56           107.17     3
##   4:             79.78       94.29           103.61     4
##   5:             83.24       99.40           106.55     5
##  ---                                                     
## 103:             74.90       73.79           100.18     7
## 104:             69.21       68.62            99.29     8
## 105:             71.45       71.92           100.37     9
## 106:             67.84       66.26           105.66    10
## 107:             65.99       61.08           105.69    11

Here, we add month effect to our regression model. Now, the model takes into account the differences due to months. From the regression summary, we can say that some of the months significantly affect the CPI value of Clothing. Residual analysis is also conducted below. Although the correlation due to seasonality is not significant now, still, we have lag 1 autocorrelation. We will try to handle this in the next chapter.

fit <- lm(Clothing~US_DOLLAR+as.factor(month)+CCI_Semi_durable, data = hw3_data_complete)
summary(fit)
## 
## Call:
## lm(formula = Clothing ~ US_DOLLAR + as.factor(month) + CCI_Semi_durable, 
##     data = hw3_data_complete)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.728  -4.507  -0.013   4.460  23.252 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -75.6974    34.5079  -2.194 0.030755 *  
## US_DOLLAR           22.1511     0.6472  34.226  < 2e-16 ***
## as.factor(month)2   -6.6642     4.2323  -1.575 0.118742    
## as.factor(month)3   -7.6141     4.2247  -1.802 0.074745 .  
## as.factor(month)4    6.8097     4.2163   1.615 0.109674    
## as.factor(month)5   17.7998     4.2253   4.213 5.83e-05 ***
## as.factor(month)6   14.1651     4.2499   3.333 0.001234 ** 
## as.factor(month)7    7.4053     4.2263   1.752 0.083036 .  
## as.factor(month)8   -2.9410     4.2430  -0.693 0.489952    
## as.factor(month)9  -13.3232     4.5066  -2.956 0.003944 ** 
## as.factor(month)10   7.6993     4.4038   1.748 0.083704 .  
## as.factor(month)11  16.3804     4.3385   3.776 0.000281 ***
## as.factor(month)12  15.1455     4.3767   3.461 0.000816 ***
## CCI_Semi_durable     1.6432     0.3064   5.364 5.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.94 on 93 degrees of freedom
## Multiple R-squared:  0.9514, Adjusted R-squared:  0.9446 
## F-statistic:   140 on 13 and 93 DF,  p-value: < 2.2e-16
checkresiduals(fit,lag=12)

## 
##  Breusch-Godfrey test for serial correlation of order up to 12
## 
## data:  Residuals
## LM test = 65.668, df = 12, p-value = 2.051e-09
#get fitted values
hw3_data_complete[,fitted:=fitted(fit)]
hw3_data_complete[,residual:=residuals(fit)]
 
p1<-hw3_data_complete%>%
  ggplot(aes(x=fitted, y=residual)) + geom_point()

p2<-hw3_data_complete %>%
    ggplot(aes(x=fitted, y=Clothing)) + 
  geom_point() +
  geom_abline(slope=1, intercept=0)

gridExtra::grid.arrange(p1, p2, p3, nrow=2)

7.1 Prediction

Below, you can see the predicted vs actual values month by month. From the residual analysis, and also from the graph below, it can be concluded that the variance of the residuals tends to increase starting from 2018. The reason of this strange behavior could also be the result of the political tensions in Turkey during that times(Priest Brunson crisis) which directly affects the exchange rates and distorts the predictions.

cols <- c("predicted" = "orange", "actual" = "blue")
ggplot<-ggplot() + 
  geom_line(data = hw3_data_complete, aes(x = Date, y = fitted,color = "predicted")) +
  geom_line(data = hw3_data_complete, aes(x = Date, y = Clothing,color = "actual")) +
  xlab('time') +
  ylab('CPI Clothing') +
  scale_color_manual(values = cols)
ggplotly(ggplot)

8 Residual Analysis - Autoregressive Model

We added the seasonality effect as well as our predictor variables, but still the autocorrelation in the residuals persists. Now, let’s try to account for that in our model.

fit <- lm(Clothing~US_DOLLAR+as.factor(month)+CCI_Semi_durable, data = hw3_data_complete)

acf(fit$residuals)

pacf(fit$residuals)

From the autocorrelation and partial autocorrelation charts above, we can see that first lag value is above 0.6 which is highly significant. This means, our model’s residual is extremely affected by the residual of the day before. Below, we can see the relation better:

ggplot()+geom_point(aes(x=fit$residuals[-1],y=fit$residuals[-length(fit$residuals)]))+
  labs(x="residuals",y="One month lagged residuals", title="Autocorrelation in residuals")+
  theme_classic()

#adding lag1 column
hw3_data_complete[,"lag1_residual"] <- lag(hw3_data_complete$residual, 1)
# first order autoregressive model for residuals
autoregressive_fit<-lm(residual~lag1_residual,hw3_data_complete)
summary(autoregressive_fit)
## 
## Call:
## lm(formula = residual ~ lag1_residual, data = hw3_data_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4395  -3.1005  -0.1178   3.0579  20.2399 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -0.14605    0.56500  -0.258    0.797    
## lag1_residual  0.76620    0.07157  10.706   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.814 on 104 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5243, Adjusted R-squared:  0.5197 
## F-statistic: 114.6 on 1 and 104 DF,  p-value: < 2.2e-16
x<-as.list(fitted(autoregressive_fit))
y<-0
y<-append(y,x)
hw3_data_complete[,residual_pred:=as.numeric(y)]

hw3_data_complete$new_predict=hw3_data_complete$fitted+hw3_data_complete$residual_pred
cols <- c("predicted" = "orange", "actual" = "blue")
ggplot<-ggplot() + 
  geom_line(data = hw3_data_complete, aes(x = Date, y = new_predict,color = "predicted")) +
  geom_line(data = hw3_data_complete, aes(x = Date, y = Clothing,color = "actual")) +
  xlab('time') +
  ylab('CPI Clothing') +
  scale_color_manual(values = cols)
ggplotly(ggplot)

Below, you can see the new time series chart of residuals after including the autoregressive model applied to residuals. Now, we eliminated the autocorrelation problem in the residuals which is a good sign! However, it should not be overlooked that the abnormality during the fall season of 2018 still manifests itself in the model.

hw3_data_complete[,"new_residual"]=hw3_data_complete$Clothing-hw3_data_complete$new_predict
ggplot<-ggplot() + 
  geom_line(data = hw3_data_complete, aes(x = Date, y = new_residual))+
  theme_classic()+
  labs(y="new residuals",title="New Residuals After Including Autoregressive Model of Residuals")
ggplotly(ggplot)
acf(hw3_data_complete$new_residual)